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Abstract 

Anomaly-based intrusion detection (AID) techniques are useful for de- 
tecting novel intrusions into computing resources. One of the most suc- 
cessful AID detectors proposed to date is stide, which is based on analysis 
of system call sequences. In this paper, we present a detailed formal 
framework to analyze, understand and improve the performance of stide 
and similar AID techniques. Several important properties of stide-like 
detectors are established through formal proofs, and validated by care- 
fully conducted experiments using test datasets. Finally, the framework 
is utilized to design two applications to improve the cost and performance 
of stide-like detectors which are based on sequence analysis. The first 
application reduces the cost of developing AID detectors by identifying 
the critical sections in the training dataset, and the second application 
identifies the intrusion context in the intrusive dataset, that helps to fine- 
tune the detectors. Such fine-tuning in turn helps to improve detection 
rate and reduce false alarm rate, thereby increasing the effectiveness and 
efficiency of the intrusion detectors. 

1 Introduction 

Since the concept of intrusion detection in computer systems was proposed by 
Anderson [1], many research studies have been carried out to find appropriate in- 
trusion detection techniques to protect the resources in computers or networks 
[20] [14] [18] [3]. However, the network disaster caused recently by Nimda, 
MSBlast and MSSasser highlights the shortcomings of the intrusion detection 
techniques deployed in our network infrastructures [13], and indicates that in- 
trusion detection techniques still have a long way to go before they can provide 
effective protection to computing resources. 

In general, the intrusion detection techniques can be categorized into signature- 
based intrusion detection (SID) and anomaly-based intrusion detection (AID) 
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ones. The SID techniques build (and/or update) intrusion signature bases that 
include the signature of every known intrusion. Then, the resource behavior 
that matches any intrusion signature in the bases is labeled as an intrusion. 
Obviously, previously unknown intrusions cannot be detected by this technique. 
Besides this drawback, the requirement of instant updating of the intrusion 
signature bases imposes a severe performance bottleneck on SID techniques. 

Anomaly-based intrusion detection techniques have become a focus of intense 
research as they offer an useful alternative to SID, that is capable of detecting 
novel intrusions. One of their implicit assumptions is that the violations or 
anomalies are indications of intrusions, i.e. the anomaly is caused by an intrusion 
into the resource. This assumption, though largely correct, may not be always 
true as some malicious intrusions may not violate the parameters of normal 
behaviors, whereas some non-malicious activities may appear as violations [26]. 

Almost all AID techniques work as follows. First, a model of normally 
behaving users [11] [12] and/or processes [9] [18] is built. Then, an intrusion is 
detected by comparing the actual current behavior against the normal model 
and taking actions according to some predetermined security policies [3] [4] 
[21]. Although many AID techniques have been proposed to date, no single 
AID technique can effectively detect all types of intrusions into the resources 
under various scenarios [13]. More specifically, they suffer from high false alarm 
rate that tends to reduce the effectiveness of true alarms because of the base 
rate fallacy [2]. In addition, the high cost of and insufficient guideline about 
training the normal model do not make matters any better for AID. 

In this paper, instead of proposing a new AID technique, we develop a for- 
mal framework to argue about and analyze the properties of one typical AID 
technique called stide [7] [6]. Generally speaking, to be efficient, any AID tech- 
nique must try to increase the detection rate simultaneously keeping the false 
alarm rate to a minimum. For this reason, it is important to understand the 
factors that suppress the detection rate, and lead to false alarms. In this frame- 
work for stide, the factors are identified as the minimum foreign sequences in 
the intrusive dataset and the maximum self sequences in the test dataset, and 
the relations between these factors and stide efficiency are expressed and dis- 
cussed. In addition, most related works to stide are interpreted in a logical way 
under this framework, namely, mimicry attacks, information hiding techniques, 
t-stide, variable-length patterns, and locality frame scheme. 

Our aim in this paper is just that, and we do so by providing a useful formal- 
ism that not only helps in our understanding of the underlying dynamics among 
various factors (e.g. the influence of the completeness of the training dataset, 
the complexity of processes etc.), but also provides a practical guideline as to 
how to develop efficient training procedures for AID detectors to make train- 
ing faster and how to identify the intrusion context in the intrusive dataset to 
study intrusion characteristics. Contradictory to our general concept that more 
training audit trails will lead to more efficient stide detectors, the experimental 
result show that there are critical sections in the training audit trails, which 
are important to stide efficiency. Our trimming scheme is to find such critical 
sections in the training audit trails. Ultimately, the framework gives guidelines 
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for selecting stidc for intrusion detection (i.e., what are the applicable scenar- 
ios of stide). Though our discussion is based on stide, the framework provides 
insights that are more generally applicable to any sequence analysis based AID 
technique. 

Related work. Sequence time-delay embedding or (stide), was first proposed 
by Forrest et al [6] for privileged Unix processes. The method is an instance 
of a computer immunology system to protect the computer systems using the 
principles of natural immune systems. However, throughout the series of papers 
by Forrest et al [9] [22] [28], there is this "magic number 6" [19] which is empiri- 
cally determined to be the length of stide detectors to obtain effective detection 
of all anomalies in intrusive datasets. This so called 'Why six?' problem [24] 
for stide has stimulated a lot of research [15] [25]. Finally, Tan et al. [25] [24] 
showed that the correct answer to the problem lies in the fact that the lower 
bound on the stide detector length is determined by the length of the minimum 
foreign sequence(s) in the intrusive dataset. 

However, Tan et al. [25] fell short of providing a comprehensive framework 
that systematically analyzes the interactions among various factors affecting 
the operational limits of stide detectors. In particular, it fails to establish the 
effects of incompleteness of the training dataset on the effectiveness of stide. 
Though it is generally understood that more completeness of the training audit 
trails leads to higher detection rate and lower false alarm rate, a quantitative 
relationship among them is not available in the literatures. In practice, even 
though there exist some techniques to generate near-complete training dataset 
[5] , it is worth studying the effect of the completeness of the training dataset on 
the performance of stide detectors for the following reasons: 

1 . It is difficult to collect a complete training dataset even with the completeness- 
guarantee techniques; 

2. The existing completeness-guarantee techniques are only specific for the 
system-call-based events in a host. However, as a general technique, stide 
is not only applicable to system call sequences in a host, but also other 
event sequences in diverse environments, such as networks; 

3. In a converse manner, the knowledge of the precise influence of the com- 
pleteness of the training dataset on the efficiency of stide detectors will be 
useful to propose or improve completeness-guarantee techniques; 

4. The 1 concept drifting'' problem in the normal model for anomaly-based in- 
trusion detection tends to make the normal model always incomplete; 

5. Most important of all, the tradeoff between the completeness of the train- 
ing dataset and the efficiency of the stide detector needs to be quantified. 
It is a common sense that, as the training dataset approaches more com- 
pleteness, more efforts are needed to achieve any information gain. Fur- 
thermore, it is possible that the efficiency loss due to the incompleteness 
will be made up by modeling generalization. 



3 



Contributions of this paper. The main contributions are summarized below: 

• A formal framework is proposed to determine the operational limits for 
stide-like AID detectors. Under the framework, the other techniques re- 
lated to stide, namely, mimicry attacks, information hiding techniques, 
variable length patterns, t-stide and locality frame, are interpreted in a 
logical way. 

• Under the framework, a comprehensive solution to the 'Why six?' problem 
is achieved, which extends the one presented by Tan et al. [24] [25]. 

• The influence of the completeness of the training dataset on stide efficiency 
is evaluated. 

• A methodology is derived from the formal framework for trimming the 
training data for a specific detection performance by identifying and elim- 
inating non-critical sections since they yield no additional information 
gain. This saves both training time and space for storing training data. 

• A scheme for identifying the intrusion context is proposed, and several use- 
ful findings from the minimum foreign sequences in the intrusive dataset 
are reported, at least for AID techniques based on sequence analysis. 

The remaining paper is organized as follows. Section 2 gives the notations 
and definitions to help the readers in understanding the rest of the paper. In 
section 3, stide is briefly introduced and expressed formally. The performance 
measures such as the effectiveness, completeness and efficiency of an anomaly- 
based intrusion detector are defined and several theorems on them are presented 
and proved in section 4. In addition, the operational limits for stide detectors are 
determined. In section 5, the influence of the completeness of training dataset 
on stide efficiency is evaluated, and then the intrusion context identification 
scheme is proposed and evaluated using a typical dataset. In the last section, 
conclusions are drawn and future work on our framework is discussed. 

2 Notations and Definitions 

2.1 Notations 

Sequences and Sequence Sets: Let £ denote the dataset for a process, which 
consists of event logs with the identity of the associated running process. A 
sequence S in X is an event series constituted by contiguous events in £ with 
the same process identity, and its length is denoted as \S\. Specially, is a 
sequence with length 0, and X itself is a sequence as well. SS(T,,l) denotes 
the set of all the sequences of length I (I ^ 0), which are collected from X. 
Thus, 55(£,0) = {<))}. Furthermore, 55(E) = (J+^ 55(£,Z). In any subset 
of 55' (E) C 55(E), |55'| m i„(E) 1 is the minimum length of all sequences in 

1 We use the notation |...| to represent the length of any member sequence in a sequence 
set, instead of its size. 
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55'(E), and SS' min (T,) consists of the sequences with length |55'| m j„(E) in 
55'(E). As a special case, \SS'\ min (E) = if 4> G SS' min (E), and |55| mi „(E) = 
1. 

Example 1 Suppose E = a&c. a6 is a sequence in E wii/i length 2, 55(E, 2) = 
{ab, fee}, and 55(E) = {0, a, 6, c, a6, be, 

abc}. For a subset 55'(E) = {&, c, a6, aoc}, |55'| m i n (E) = 1 and 55^ in (E) = 
{6,c} 

Set Operations: For given datasets Ei and E 2 of a process and corresponding 
sequence sets 55(Ei,^) and 55(E 2 ,Z), the set operations (U,fl, — ) are defined 
as follows (7^0): 

(1) 55(E 1 ,Z)u55(E 2 ,Z) 

= {S\(S G 55(E X , 0) V (5 e 55(E 2 , I))} 

(2) 55(E 1 ,/)n55(E 2 ,/) 

= {5|(5 G 55(Ex, 0) A (5 G 55(E 2 , Z))} 

(3) 55(E l7 -55(E 2 ,0 

= {5|(5 G 55(Ei, 0) A (5 £ 55(E 2 , /))} 

In addition, E x E 2 is a special concatenation of the datasets E x and E 2 , such 
that there is no sequence in 55(Ei E 2 , I), in which some events belong to Ei 
and other events belong to Si. This is because the process identity in Ei is 
different from that in E 2 . Therefore, 55(Ei E 2 , 1) = 55(Ei, I) U 55(E 2 , 1). 

Example 2 Suppose that Ei = abc and E 2 = ab. 55(Ei, 

2) = {ab,bc}, and 55(E 2 ,2) = {ab}. Thus, the set operations 55(Ei,2) U 
55(E 2 ,2) = {ab,bc}, 55(Ei, 2)n55(E 2 , 2) = {ab}, and 55(E 1; 2)-55(E 2 , 2) = 
{be}. Ei E 2 = abc;ab. 

Supersequence and Subsequence: If S su i, is a contiguous subsequence of 5 
and |5| — \S su i,\ = k, then 5 SU 6 is said to be a fc-order subsequence of 5, and 
denoted as 5 stt ;, =4k 5. Similarly, 5 sup ^=fe 5 denotes that 5 sup is a fc-order 
supersequence in which 5 is a contiguous subsequence, and |5 su b| — |5| = k. It 
is worth noting that <j) =4\S\ 5, and 5 £=|s| (p. For example, a6 =<!i abc, a =<! 2 afcc 
and ab >p\ a. In addition, the terms subsequence and supersequence will always 
imply contiguity in this paper, such that ac ^ i abc. 

2.2 Definitions 

Central to our framework are the twin concepts of the minimum foreign sequence 
- MFS, and the maximum self sequence - MSS. Their definitions, expressions 
and relation are given below. 

2.2.1 Foreign sequences and self sequences 

Let E re / be the reference dataset, and E tgt be the target dataset. For any 
sequence 5 G SS(T ltg t) 1 if 5 is also in 55(E re /), 5 will be called a self sequence, 
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otherwise, it is a foreign sequence to £ re /. Furthermore, FRGN{Yi tg t\^'ref) is 
defined as the set of foreign sequences of T, tgt w.r.t. E re /. Similarly, the set of 
self sequences is defined as SELF{Y> tg t\ 
E re /). Mathematically, 

FRGNCZtgtpref) = U+7 SSptgt, I) - SS(Z ref , I) (1) 
SELF{H tgt \Z ref ) = U+7 SS{H tgt ,l) n SS{^ re} , I) (2) 

Thus, FRGN{E tgt \E ref ) U SELF(Z tgt \Z ref ) = SS(E tgt ). 

A sequence S in FRGN(T, tg t\^ r ef) will be called a minimum foreign se- 
quence (MFS) [25] if none of its subsequences is in FRGN(Y, tg t\'£<ref), i-e. all 
of its subsequences are in SELF(E tg t\'Eref)- The set of all minimum foreign 
sequences is denoted as MFS(E t9t |E re /). On the other hand, for any sequence 
S in SELF(T, tgt \T, re f), if there exists one 1-order supersequence that is not 
included in SELF(Z tgt \Y< ref ) (i.e., it is in FRGN(E tgt |E re/ )), then it will be 
called a maximum self sequence (MSS). The set of all maximum self sequences 
is denoted as MSS(T, tgt \T, re f). Formally, they can be expressed as: 

MFS(Z tgt \Zref) 

={S\VS(S G FRGN(V tgt \E ref )) A (VS'Vfc(S' e SS(V tg t)) 

A(S" ^ k S^S' £ FRGN(X tgt \Z ref )))} (3) 
MSS(E tgt |E re/ ) 

={S|VS(S G S£L^(E tgt |E re/ )) A (3S'(S' G SS(E tfl t)) 

A(S" >i 5) A (5' SELF(V tgt \X ref )))} (4) 

From these definitions, MFS{Y, tgt \Y, ref ) C SS(Y, tgt ) and MSS(E tg4 |E re/ ) C 
SS(Y, tg t). Furthermore, based on above notations, MF'S| m i„(E tg t|E re /) ^ 1, 
|MSS| mjn (E tflt |E re/ ) > 0. Specially, if MFS(Z tgt |E re/ ) = $, |MFS| min (E tgt |E re/ ) 
+oo. The same property can be applied to M55(E tgi |E re /). 

Example 3 Suppose that E re / = abc, T, tgt — abaa. The sequence sets of these 
two datasets are: SS(Y, re f) — {<fi, a, b, c, 

ab, be, abc}, SS(E tgt ) — {<p, a, b, ab, 6a, aa, aba, baa, abaa}. Next, 

FRGN(Et g t\'Eref) = {ba,aa, aba, baa, abaa} 
SELF(E tgt \H re f) = {4>,a,b,ab} 



Finally, we can deduce: 



MFS{H tgt \L ref ) = {ba,aa} 

MSS(E tgt |E re/ ) = {a,b,ab} 

MFS min (Y,t g t\T, ref ) = {ba,aa} 

MSS min (T, tgt \T, ref ) = {a,b} 
\MFS\ mm (Z tgt \ 

\MSS\ min (T, tgt \T, re f) = 1 
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2.2.2 Relation between MFS and MSS 

One relationship between MFSs and MSSs of two datasets S re y and T, tgt is 
given by the following theorem 2 . 

Theorem 1 For two datasets T, re f and T, tgt of a process, the following relation 
holds. 

\MSS\ mm (Ztgt\Zref) = \MFS\ min (E tflt |E re/ ) - 1 (5) 
Example 4 In Example 3, it is obvious that 

\MSS\ min {Y, tgt \Z ref ) = \MFS\ mln {^tgt\Zref) -1 = 1. 

3 A formal description of stide 

In the experimental setup for stide [6] [28], there are two datasets for every 
process, the normal dataset S nm {, and the intrusive dataset Sj„t, which are 
defined below. 

Definition 1 (Normal Dataset) The normal dataset is a dataset S nm ; that 
is utilized to train the normal model of a process for stide, and it MUST be 
collected in the normal run of the process without any intrusion. 

Definition 2 (Intrusive Dataset) The intrusive dataset £j nt is a dataset that 
is collected when one or more intrusions were occurring during the runs of a 
process. 

In terms of these two datasets from the same process, stide can be formally 
described as follows. Let w(> 1) denote the size of the detector window. In the 
modeling phase, the normal model of the process is obtained as: SS(T, nm i,oj). 
Then, in the detecting phase, the foreign sequences in the intrusive dataset S int , 
FS(T,int\^nmi,uj), are enumerated: 

FS(T, int \T, nm i,uj) = SS(Y,i nt ,u>) — SS(E nm i,u)) 

If FSiY^mt |E nm ; , lu) ^ $, the intrusion(s) in the intrusive dataset Smt can be 
detected with the detector length uj [9] [10] [19] [25] [28] 3 . It is evident that 
the sequence set FS(T,i nt \T, nm i,u>) is strongly related to M F S ' (Y,i nt \Y, nm i) via 

I MFS | m i n (S i n t | S„ m ; ) : 

\MFS\ min {H int \ll nm i) sc uo & FS{H int \H nmU w) + $ (6) 

2 To save space, all proofs of the theorems in this paper arc provided in our (extended) 
technical report [16] at http://www.cais.ntu.edu.sg/home/technical_reports_2004.jsp. 

3 It is notable that most of these research studies only apply stide to system-call based 
sequences in a host as does the original proposal for stidc[9]. However, in principle, stide 
is applicable to other environments as well. Therefore, in our formal framework, it will not 
be specific for any environment, which is also one of our objectives to formalize the stide 
technique. 
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In its formal proposal [9], a Locality Frame Count (LFC) function is applied 
to smooth the noise, or to filter the false alarms in the process by summing 
up the number of foreign sequences found within the span of a locality frame. 
However, the LFC function does not add to or compensate for the detecting 
ability, or failure /short coming of the stide detector, and it will not be used in 
our framework. As an application of our framework, it will be interpreted later. 

Practically, even though the underlying principle is very simple, stide can 
detect most of the intrusions into the processes (the datasets from UNM [8]). 
For this reason, it is accepted as a typical and effective anomaly-based intrusion 
detector in many research studies. 

4 A formal framework for stide 

In general, the efficiency of an intrusion detection technique is determined by 
both false positives [2] and false negatives. The incompleteness of the normal 
model is well-known as the main cause for the false positives in an AID detector 
[17] [2]. However, in most of the research studies on stide [6] [25], the complete- 
ness is not adequately considered when evaluating the efficiency of stide detec- 
tors. In other words, there is an implicit assumption that the normal dataset is 
complete in the sense that it includes all the normal behaviors of a process. As 
a result, the issue of false positives has been completely ignored. However, as 
indicated in the first section, such completeness of the normal dataset is difficult 
to verify, and there is no effective method to guarantee it. 

In our framework, the implicit assumption about the completeness of the 
normal dataset is discarded, and the normal dataset is regarded as the training 
dataset Ti trn to build the known normal model of the resource. At the same time, 
a test dataset S tst is introduced to evaluate the completeness of the training 
dataset T, trn . The function of the test dataset is to evaluate the ability of 
the detector to correctly identify normal data as such without generating false 
positives. Thus, the test dataset must be collected during a normal run of 
a process without any intrusion as well. To some extent, our methodology 
corresponds to the actual scenarios where it is difficult to collect all the normal 
behaviors of a computing resource, and there are always false positives when the 
normal behaviors of the process are examined by an AID detector. In addition, 
without loss of generality, we assume that the audit trails in S int is caused by 
only one intrusion. 

4.1 A critical look at stide performance 

In our formal framework for stide, with the detector window size ui, the normal 
model SS(Titm,w) is first gleaned from S tr „. Based on its detection results 
on S tst and £» n t, all the sequences are classified as follows. The outcome of 
a detection process can be divided into four categories depending on the true 
nature of the data and the correctness of the detection result. These are shown in 
Tabic 1. Therefore, according to whether a sequence matches the normal model 
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Table 1: Four detection scenarios. 





INTRUSIVE SEQUENCE 


NORMAL SEQUENCE 


ALARM 


True Positive 


False Positive 


NON-ALARM 


False Negative 


True Negative 



SS(T, trn ,co), SS(T, tst ,co) can be split into two subsets: False Positive Sequence 
Set (denoted as FP55(E tst |E trn ,a;)), and True Negative Sequence Set (denoted 
as TNSS('E ts t\'S trn ,uj)). Similarly, depending on the detection outcome, the 
intrusive sequence set SS(T, int ,Lo) can be split into two subsets: False Negative 
Sequence Set (denoted as FNSS(T, in t\T, tr n, and True Positive Sequence Set 
(denoted as TPSS(E in t\^trn, 

u>)). Using our earlier notations, we can write the following definitions of the 
above four sequence subsets: 

FPSS(Z tst \ 
TNSS(Z tst \ 

TPSS(Y,i n t\Yitrni 

^trm LO ) 

Furthermore, 

FRGN(Yi int E tr „) 
SELF(T,i n t E tr „) 
Fi?GiV(E tst |E tr „) 

SELF(T, ts t\^trn) 

Next, according to the sequences in these four categories, we will define two 
aspects of stide performance, namely effectiveness and completeness. Finally, 
we will give the definitions and conditions for an efficient stide detector. 

4.1.1 Effectiveness of a stide detector 

Definition 3 (Effectiveness) A stide detector with detector window w is ef- 
fective to detect the intrusion in E, ni if there is at least one sequence in the 
intrusive sequence set SS(Ei nt ,u)), which is detected as a true positive, i.e., 
TPSS(Z int \X trn ,uj)^$. 

To detect an intrusion effectively, the relation between stide detector window 
size u) and the intrusion characteristics is critical to choose a proper lu for stide, 
and it is stated in the following theorem. 

Theorem 2 Let us assume that there are a training dataset E trn and an intru- 
sive dataset E int of a process. A stide detector of length lu, built from Tj trn , is 
effective w.r.t. E int , iff 

LU^\MFS\ m in(^int\^trn) (7) 



= SS(E tst , lu) - SS(Y. trn , lu) 

= SS(E tst ,u>) n SS(Y, trn ,Lu) 

= SS(T,i nt , lu) — SS(Yi trn , lu) 

= SS(E int , lu) n SS(Y, trn , lu) 



— ^i=l TPS S(Ei n t | E tr „, lu) 

= U+™FNSS(E int \X trn ,uj) 

= U+r i PPS5(E tst |E tr „,o;) 

= U+^TNSS(i: tst \^ trn ,io) 
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Example 5 Suppose thatY, trn — aba, and^i nt — ababa. Then, MF5 , m i„(Sj nt |S trn ) = 
{bab}, and \MFS\ min (Y< 

int l^irn) 

= 3. .4s SS(£ int ,l) - SS(£ tr „,l) = {a, 6}, and SS(Z int ,2) = SS(£ tr „,2) = 
{a6, 6a} 7 TPS*S*(E mt |S tr „, 1) = $ and TPSS 

(E int \Y, trn ,2) = <f>. Sttt TP55(E int |St rn ,3) = {bab} ^ <f> Thus, only if u > 3, 
the intrusion in T, int will be detected by stide effectively. 

Note that Theorem 2 merely summarizes the conclusion of Tan et al. [25], but 
in our framework, it is rather straightforward to prove its validity. 

4.1.2 Completeness of a stide detector 

Definition 4 (Completeness) A stide detector with detector window lu is 
complete if the underlying normal model built from a training dataset Yi trn is 
complete. In other words, the sequence subsets TNSS(T,t s t\^tm, to) = SS(T,t s t,u>), 
and thus FP55(St s t|S( rn , u>) = <f>. 

Due to the base-rate fallacy [2], the completeness of a stide detector is also 
critical for its application. The following theorem establishes the conditions for 
the completeness of a stide detector in terms of the detector window size u>. 

Theorem 3 Let us assume that there are a training dataset S trn and a test 
dataset Y^tst of a process. A stide detector of length u>, built from T,t rn , is 
complete w.r.t. T, ts t, iff 

lo ^ \M S S\ min (Ts tst \T. trn ) (8) 

Example 6 Suppose thatY^m = aba, and E tst = baba. Then, M S S m i n (T,tst\'^trn) = 
{ba,ab}, and \MSS\ min (T, tst \ 

Stm) = 2. As SS(Y, tst , 1) = SS(Y, trn , 1) = {a, b}, and SS(E tst , 2) = SS(Y, trn , 2) = 
{ab,ba}, FPSS(Y, tst \Y, trn ,l) = $, and FPSS{Y, tat \Y, trn ,2) = $. On the other 
hand, FPSS 

(E tsi |S trn , 3) = {bab} ^ <f>. Thus, only if tv < 2, the stide detector built from 
^trn is complete w.r.t S tst . 

Corollary 1 For a training dataset H trn and a test dataset £ tst of a process, if 
\M S S\ m i n (T, tst \Y, trn ) = 0, there are no complete stide detectors built from S trn 
w.r.t. E tst . 

4.1.3 Efficient stide detectors 

Definition 5 (Efficiency) A stide detector with detector window size w is ef- 
ficient w.r.t S tst and T, int if it is effective to detect the intrusion in Y> in t, and 
it is complete in detecting S tst . 

ft is easy to conclude that an efficient stide detector will not produce any false 
positives when analyzing S tst , and it will produce true positives when analyzing 
S mt . We are now in a position to state the condition for a stide detector to be 
efficient, which is expressed by the following theorem. 
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Theorem 4 Given a training dataset E tr „, a test dataset E tst , and an intrusive 
dataset Ej n t ; a stide detector with the detection window size to, obtained using 
E tr „, is efficient w.r.t. E tst and E, ni iff 

\MFS\ min {Yl 

Proof 1 It can be inferred from Theorem 2 and 3. 

Example 7 Suppose that Et rn = aba, Et s j = 6a<ba, and Ej n t = afcc. Then, 
M S S m in(^tst\^tm) = {ba,ab}, and MFS (E irit | E trn ) = {c} ; i/iws |M55| mi71 (E tst |E tI .„) = 
2 and \M F S\ m i n (T,i nt \T, t rn) = 1- ^or a siide detector with length u>, to be com- 
plete, lu < 2, and to be effective, u> > 1. Finally, we can get 1 < a; < 2. 



Efficient and Inefficient Areas for stide Detectors 




Figure 1: Efficient and inefficient areas for stide detectors on the training dataset 

^trn • 

Figure 1 shows the area determined by MFSmin^mt |E tm ) and M5'S' mm (E tst |E tr „) 
where efficient stide detectors must belong. If the point determined by \MFS\ m inC^int |Et rn ) 
and \MSS\ m in(Etst\Etrn) is m the efficient area, it is possible to find one or more 
efficient stide detector(s). Otherwise, no efficient stide detector can be found. 
Note that, there is one undefined area since \MFS\ m i n (Eint\'£'tm) ^ 1- 

The following corollary, drawn from the above theorem, explicitly defines 
the operational limits of a stide detector. 

Corollary 2 For a training dataset E trrt , a test dataset E tst , and an intrusive 
dataset Ej n ( of a process, the following hold: 

(a) . If \MSS\ m i n (Etst\^trn) < Mi^S | mm (E mt |E tr „), there are no efficient 

stide detectors w.r.t. E tst and Ej„t. 

(b) . With a detector window u) , if to £ \MFS\ m i n {^int\^trn), anduj > \MSS\ m in(^tst\^tm), 

the stide detector built by Ej>„ is effective, but not efficient w.r.t. Y,tat and 

(c) . With a detector window lu , ifuj < \MFS\ mi n(^int\^tm), andu < \MSS\ m i n (T, tst \T, trn ), 

the stide detector built by Et rn is complete, but not efficient w.r.t. Htst 
and E irtt . 
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4.2 Completeness of the training dataset vs. stide effi- 
ciency 

From their definitions, MFS(£i n t\'£trn) and MSS(Titst\E>trn) will be affected 
by the completeness of the training dataset to a large extent. Therefore, accord- 
ing to Theorem 4, the completeness of the training dataset is critical to stide 
efficiency. In Figure 2, the universe of all possible sequences is referred to as 



Space for Anomalous/ y 
Intrusive Behaviors 


/ Unknown 
[Normal Model 
\ (from Any 
Vest Dataset) 


Known Normal Model \ 
from Training Dataset 







Figure 2: Behavior spaces for stide: the known and unknown normal behavior 
models, and the intrusive behavior space. 

U, in which the known and unknown normal models are only complementary 
parts of the complete normal model. Outside the complete normal model is the 
intrusive behavior space for the known and unknown intrusions. However, in 
stide, all the sequences from the training dataset are regarded as normal (in the 
known normal model), and other sequences lying outside the training dataset 
are considered anomalous. Obviously, the unknown normal model is critical for 
its efficiency. Thus, in our following analysis, we assume that the test dataset 
E tst incorporates the whole unknown normal model (Figure 2). Even though 
the assumption can not be achieved in the real deployment, it is reasonable in 
analyzing stide efficiency. Given this framework, let's examine the scenario in 
which stide is suitable for detecting the intrusions into a resource. 

4.2.1 MSSs in the test dataset 

Based on Equation 3, MS'S'(E tst |E tr .„) is deduced as: 

MSS(Z tst \Z trn ) 

= {S\VS(S G SELF{Z tst \X tm )) A {3S'{S' G SS{12 tst )) 

A(5' S) A (S' SELF{H tst \H trn )))} 
= {S\WS(S G SELF{Z tst \X trn )) A (3S'(S' G SS{E tst )) 

A(S" H S) A (S' G Fi?G7V(S tst |E tr „)))} 

Thus, MSS(^tst\^tm) is affected by FRGN(T, tst \T, trn ), which is in the un- 
known normal model (Figure 2). Theoretically, for any E trra and E tst , if E trra is 
not complete, \M S S\ min (Y, tst \ 
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S tr „) can vary from to +00 (Note: +00 here indicates a potentially large 
number bounded by the length of the dataset E tst ). 



4.2.2 MFSs in the intrusive dataset 

Let's first define one more concept in our framework: 

Definition 6 The common false positive sequence set in the intrusive dataset 
CFPS(T,i nt7 T, t st\^trn) is: 

) = FRGN{V tst |E trn ) n 55(E int ) 
It is obvious that CFPS(Z in t,Ztst\Ztrn) C SS(E int ). 

Example 8 Suppose that E trn = Ijk, E tst = jfc^ and E int = ckl. Based on se- 
quence set definition, SS(T, trn ) = {<f>,l,j,k,lj,jk,ljk}, SS(T, tst ) = {<j>,j,k,l,jk,kl,jkl}, 
and SS(E int ) = {4>, c, k, I, ck, kl, ckl}. Next, we get FRGN(E tst \ 
Stm) = {kl,jkl}. Therefore, 



The following theorem is deduced to determine what affects \MFS\ m i n (T, int E tr „) 
in our framework. 

Theorem 5 For the datasets E trn , E tfrt) and Y. int , 



Example 9 Take the same scenario in example 8. In it, M FS m i n (Tii nt E tr „ 
E tst ) = {c} 7 and |M.FS| min (E int |E trn O E tat ) = 1. 17ms, we can determine 



= min{2, 1} = 1, which is correct as MFS m i„(T,i nt \'Strn) — {c,kl}. On the 
other hand, ifY, int = jkl, SS(E int ) = 55(E tst ). Thus, CFPS(E int , E tst |E tr „) = 
{kl,jkl}, MFS min 

(Sj n t|E trn ©E tst ) = <f>. As M F S m in{^int\E>trn) = {kl} , \MFS\ m i n (Ej„ t | Ef r „) = 

2 = mm(2, +00). 

As indicated by the intrusive space in Figure 2, |M.Fif>| m i n 
(Ei„ t |E trn E tst ) reflects the intrusion characteristics in a specific intrusive 
dataset E irlt , which does not depend on the completeness of the training dataset. 
Thus, without regard to the completeness of E trn , M F S (T,i nt \T, trn E tst ) and 
SS(T,i n t) will always be stable. According to Theorem 5, if \CFPS\ m i n (T,i nt , E tst |E tr „) < 
©E tst ), \M F S\ min (T, int \Y\ trn ) will be affected through the 

set FRGN 

(E tst |E trn ) as the completeness of the training dataset increases. 

In summary, both MFS(T, in t\T, tr n) and MSS(E tst |E trn ) are affected by the 
completeness of the training dataset E trn , i.e. FRGN(T, ts t\^tm)- 



CFPS{Y*i nt , E tst |E trn ) 



= 2 



{kl} 



\M F S\ m i n (T,i nt \T,trn) 

= min(\CFPS\ m in(Eint, Etst|E tr „), 
I MFS I TO j„ (Ej„ t I Ef r „ Et s t)) 



(10) 
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4.2.3 Enhancing efficiency of a stide detector 

Theorem 6 Assume that, for a process, the training dataset from which the 
known model is built, is T trn , the test dataset from which the whole unknown 
normal model is built, is T tst , and the intrusive dataset is T int . Then, there 
exist one or more efficient stide detectors iff 

\MSS\ m i n (T,tst\'£<trn) ^ \MFS\ m in (Tint | Ttrn Etst ) (H) 

Example 10 Take the two scenarios in examples 8 and 9. If Ej„ t = ckl, 

M S S m i n (T tst \T trn ) = {l,k}, M F S m in(Tii nt \ 

T trn ) = {c}; an d MFS min (T int \T tr n S tst ) = {c}. Thus, there exists only one 
efficient stide detector with length ui = 1. However, ifT int — jkl, MSS m i n (T tst \T trn ) — 
{l,k}, M FS m i n (Ei n t\'Etrn) = {kl} , and thus, there do not exist efficient stide 
detectors. At the same time, as MFS mm (T int \T trn ®Tt s t) = and\MFS\ min (T, int \T, trn (D 
Tit st) = +oo, the above equation in Theorem 6 does not hold. 

What the above theorem tells us is that with increasing completeness of the 
training dataset, the intrusion characteristics reflected by \MFS\ m in(T in t \T trn Q 
Stst) ac ts more and more as a threshold for the efficiency of a stide detector. 
Therefore, sooner or later, the intrusion must manifest itself in the intrusive 
dataset with a finite (reasonably small) length of the MFSs, otherwise, there 
will be no efficient stide detector for the intrusion. 

The following corollary, which follows from the theorems 6 and 4, emphasizes 
the condition to build efficient stide detectors from a training dataset: 

Corollary 3 Assume that, for a process, the training dataset from which the 
known normal model is built, is T trn , the test dataset from which the unknown 
normal model is constructed, is Stst, and the intrusive dataset is Tint- Then, 
if \MFS\min(T in t\Ttrn) < \M FS\ mm (T in t\Ttm Q T tst ) , there are no efficient 
stide detectors. 

Ideally, if the training dataset T trn is complete so that it includes all the 
normal behaviors of a process (i.e., for any E tsi , |M55| m i n (E tst |E trn ) = +oo), 

\CFPS\ m i n (Ti n t, T ts t\ 

T t m) = +oo. At the same time, the MFSs in an intrusive dataset T int , is in 
fact MFS(Ti n t\Ttm E tst ), which is the absolutely ideal scenario. Definitely, 
under the ideal scenario, there will be efficient stide detectors trained by the 

4.3 Interpretation of related work on stide 

Following the publication of stide, several research studies have been published 
with criticisms and suggestions of improvement of stide [27] [26] [28] [29] . Under 
our proposed framework, they can be interpreted in a logical way to determine 
their basic foundations. 
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4.3.1 Mimicry attacks and intrusion information hiding 

The mimicry attacks are proposed by Wagner to show the weakness of the stide 
technique [27]. In a nutshell, the proposed strategies to do mimicry attacks 
are: intrusive behavior avoidance; waiting for the intrusive behaviors accepted 
by normal model passively and actively; replacing the system call parameters; in- 
serting no-effect system calls; creating equivalent variations of a given malicious 
sequence. Almost with the same principles, the information hiding paradigm is 
also applied to indicate that the stide technique is easy to be evaded [26] . 

Utilizing our proposed framework, it is very obvious that all these evading 
strategies are to make the minimum foreign sequences of an intrusion as large 
as possible so that eventually it grows beyond the length of stide detector win- 
dow, making the stide detector ineffective. From the viewpoint of information 
theory, the information gain in the intrusive dataset (which is manipulated by 
mimicry attacks or information hiding techniques) is too small to be detected. 
Furthermore, since these two techniques focus on applying stide to system call 
sequences in the host-based systems with more or less strong assumptions, the 
only conclusion that can be made is that the stide is not suitable for detect- 
ing the mimicry attacks based on the system call sequences. Therefore, at this 
point, no conclusion can be drawn about the influence of these techniques on the 
efficiency of a general stide detector (with a ' good' encoding replacing 1 system 
caW). Furthermore, from our following experimental results, the large quantity 
of the minimum foreign sequences discovered in every intrusive dataset will, 
to a large extent, discourage mimicry attacks and intrusion information hiding 
techniques, because all of the minimum foreign sequences must be mimicked 
and hidden to evade detection. 

In addition, we found that it is possible to make stide inefficient by getting 
control of one application and then nudging it to generate smaller minimum 
common false positives during the run of an intrusion (Theorem 5). If the 
quantity of the false positives are large enough during the intrusion, the stide 
detector will be useless in detecting the intrusion due to the base-rate fallacy [2]. 

4.3.2 t-stide and variable length patterns 

As variations of stide, t-stide and variable-length patterns are proposed in [28] 
and [29], and both of them utilize the frequency information of each sequence, 
t-stide is very similar to stide except that it discards infrequent sequences whose 
frequency is smaller than a threshold t [28] . The performance of t-stide is found 
to be unsatisfactory by the author. Using our framework, the obvious reason for 
t-stide's failure is that the discarded sequences will increase the incompleteness 
of the training dataset, that will decrease its detection efficiency. Furthermore, 
with the method of minimum foreign sequence discovery discussed later in sec- 
tion 5.4, the negative conclusion regarding t-stide can be further explained by 
comparing the MFSs in the intrusive dataset for stide and t-stide. 

Since the principles for stide and variable-length patterns [29] are different, 
their comparisons will be based on the detection performance by considering 
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the fact that only the patterns (or sequences) are used in the detection phase. 
As indicated in our framework, the minimum foreign sequence is the main char- 
acteristic left in the audit trails for the sequence-related AID techniques. In the 
principles of variable-length patterns, we noticed that if the minimum length of 
the MFSs of an intrusion is larger than 1, the intrusion will be easily ignored 
by variable-length patterns. For example, suppose that the normal model for 
the variable-length method is {ABCD, CAE, FBD}. If the minimum foreign se- 
quence of an intrusion is 'DC, and the intrusive audit trail is 'ABCDCAEFBD', 
the intrusion will not be detected. As the experimental results in [29] are very 
good, we suspect that the MFS of the chosen intrusions is 1, just like the 'mis- 
configuration' for 'wu-ftpd' in our experiments described later. 

4.3.3 The significance of locality frame count 

For stide, following [28] [9], the anomaly value of a trace is derived from the 
number of mismatches occurring in a temporally local region, called a local- 
ity frame (LF). Then, a locality frame count (LFC) is used as a threshold to 
determine whether the locality frame is anomalous in the trace. 

Let us assume that the stide detector window is of length w. Suppose that, in 
a locality frame of a trace, there are at least n MFSs: MFSi, MFS 2 , ■ ■ ■ , MFS n 
with lengths smaller than u>, and MFSk has the minimum length 1^ among them. 
Since by definition, one MFS can not completely include another MFS, the 
minimum number of mismatches will take place when the MFS's are maximally 
overlapped, i.e., MFS2 starts one event later than MFS\, MFS3 starts one 
event later than MFS2 and so on. In this pathological case, the minimum 
number of anomalies detected in the LF should be u — h + n. Therefore, for 
successful detection of the anomaly in the LF, we must have LFC ^ u) — Ik + n. 

From the above analysis, we can identify these ways to successfully detect 
intrusions in an LF: (1) making the detector window larger; (2) making the 
minimum foreign sequence smaller, and (3) making the number of MFSs in one 
locality frame as large as possible to form a cluster of anomalies [9] . Since larger 
detector window will degrade the efficiency of stide, the latter two options can 
be considered to serve as a guideline for choosing proper lengths for LF and 
LFC. 

5 Applications 

Apart from strengthening the comprehension of the inherent dynamics of stide- 
like AID detectors, the formal framework can also be applied to accelerate the 
training of stide-like AID detectors with less training audit trails, to identify the 
precise context of an intrusion and so on. Other than evaluating the influence of 
the completeness of training dataset on stide efficiency, two of its applications 
will be described in detail: (1) trimming the normal dataset without losing effi- 
ciency for a given detection performance (thus the training procedure is speed 
up), and (2)identifying the intrusion contexts in the intrusive dataset. 
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5.1 Experimental setup and datasets 



Table 2: The dataset specifications. 



Normal 


Intrusive 


No. of 


No. of 


Datasets 


Datasets 


Traces 


System Calls 


livo-namod-UNM 





142 


9230572 




buffer overflow- 1 


3 


969 




buffer ovcrflow-2 


2 


831 


livo-lpr-MIT 




2703 


2926304 




lprcp 


1001 


165248 


scndmail-CERT 




294 


1576086 




syslog-local-1 


6 


1516 




syslog-local-2 


6 


1574 




syslog- remote- 1 


7 


1861 




syslog-rcmote-2 


4 


1553 




ccrt-sm565a 


3 


275 




ccrt-sm5x 


8 


1537 


scndmail-UNM 




346 


1799764 




decode 


36 


3067 




forward loops 


36 


2569 




sunsendmailcp 


3 


1119 


syn-wu-ftpd 




8 


180315 




misconfiguration 


5 


1363 


syn-xlock-UNM 




71 


339177 




buffer overflow- 1 


1 


489 




buffer ovcrflow-2 


1 


460 



For the convenience of comparison, the datasets [8] that are used in [28] [25] 
have been used in our experiments as well. In addition, we have discarded the 
normal datasets of several processes that are too small to use in our framework. 
The normal and intrusive datasets for selected processes are specified in Table 2. 
From the table, our selected datasets represent most processes and intrusions 
into the processes. Furthermore, to analyze the characteristics of every intru- 
sion, its intrusive dataset, even into the same process, is treated as an individual 
dataset. 

5.2 The influence of the completeness of training dataset 
on stide efficiency 

In this section, the influence of the completeness of the training dataset on the 
efficiency of stide detectors will be evaluated. For that purpose, we regard the 
normal dataset for every process to be complete for the normal model of the 
process, and we induce incompleteness by splitting the normal dataset into a 
training dataset and a test dataset. To remove any dependency, we choose the 
training datasets with m varying sizes Size\, . . . , Size m and n varying starting 
points Posi, . . . , Pos n within the length of the normal dataset £ nm ;. To achieve 
it, the normal dataset is treated as a continuous ring using wrap around of the 
linear dataset. Given any splitting point Posi and any size Sizej, the part 
from Posi to (Posi + Sizej)%\'E nm i\ is selected as T, trn (i,j), and whatever 
remains is chosen as the test dataset T, tst (i,j). Based on T, trn (i, j) and T, tst (i,j), 
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the completeness of the training dataset will be evaluated considering stide 
efficiency. 

On the other hand, in stide- like AID techniques, the frequency information of 
the events in the training dataset is not utilized, so trimming the repeated events 
(or sequences) in the training dataset is useful to economize the training time 
without any loss of efficiency. In the trimming procedure, the critical sections 
in a normal dataset are identified to produce a compact training dataset. One 
of the requirements for the critical section is that the stide detectors trained 
by it must be as efficient as when they are trained by the complete untrimmed 
dataset. Finally, the most compact critical section in the normal dataset are 
chosen for stide without sacrificing its efficiency 

To achieve it, we also develop two graphical tools to make it easy and conve- 
nient to analyze the completeness of the training dataset, and the characteristics 
of the datasets. They are described below. 

5.2.1 MFS-MSS Average Curves 

These curves are inspired by Theorem 6, which can be used to depict the in- 
fluence of the completeness of the training dataset on the detection efficiency 
graphically. At the same time, our objective for these curves is to evaluate the 
dynamics in stide efficiency with the completeness of training dataset, thus, we 
only concern about the size of training dataset. For this reason, the average 
values for \MSS\ min (T, tst \^ trn ) and |MF5| min (E int |St rn ) for a given training 
data size Sizej (1 ^ j ^ ra) are first calculated: 

I MSS m i n I avg (j) 

1 ™ 

= - * \M S S\ mln {^tst(i, j)\^trn(ij)) 

1=1 

\MFS min \ avg (j) 
1 ™ 

= -*V|MFS| min (E in t|£ trn (i,j)) 
n 

i=i 

Then, we plot the average values of \M S S\ m i n (Yitst \^tm) an d \M F S\ m i n {Y>i nt \Eitrn) 
against the corresponding sizes of the dataset T, trn . We call the resulting graphs 
as MFS-MSS Average Curves (MMAC) (Figure 3). 

5.2.2 MFS-MSS Matrix 

Let us first introduce a new concept 'critical section'. Within context of stide, for 
a splitting point Post, Sizej is a critical section CS(i, A) if \M S S\ m i n (Etst(i, j)\^tm(i, j)) > 
A but \MSS\ m i n {Ti t st{i,j — l)|£t rn (i, j — 1)) < A. Obviously, the critical sec- 
tion is indispensable to provide stide detectors with the detection performance 
A. Other than the critical section CS(i,X), the remaining part of the normal 
dataset can be discarded as it has negligible effect on the stide detection effi- 
ciency. 



(12) 
(13) 
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The MFS-MSS Matrix (MMM) is defined in order to help identify the critical 
sections in the normal dataset with respect to the predefined detection perfor- 
mance A. In the matrix, the columns (the horizontal axis) are defined by the 
splitting sizes of the training dataset {Sizei, Sizei, ■ • ■ , Size m }, and the rows 
(the vertical axis) are defined by the splitting points of the training dataset 
{Posi, Pos 2 , ■ ■ ■ , Pos n } (as in Figure 4). According to our proposed formal 
framework (especially from Eqn (10) and Theorem 6), an entry MMM(i,j) in 
an MMM matrix will be labeled as 'efficient' if \M S S\ m in(^tst(i, j)\^tm(i, j)) ^ 
A, otherwise, it is labeled as 'inefficient'. Furthermore, for every specific pair 
of Posi and Sizej, if MMM(i,j) is inefficient but MMM(i, j + 1) is efficient, 
the transition from the inefficient entry MMM 

(i,j) to the efficient entry MMM(i,j + 1) is named as an efficiency transi- 
tion in the MMM matrix. From the efficiency transition, it can be concluded 
that the section in Yj nrn i from Posi to i^Posi -\- Sizej+\)%\Yj nra i\^ is critical for 
building efficient stide detectors, i.e., it is a critical section CS(i, A). 

After identifying the critical sections for all splitting points, we choose the 
most compact critical section MCCS(X)(= CS(i, A)) in the normal dataset 
as the training dataset for stide. As its name implies, for any other critical 
section CS(k,X) (i ^ k), \MCCS{\)\ < \CS{k,\)\. Since the redundant parts 
in the normal dataset can be trimmed by using MCC'S(X), the training time for 
the stide detectors can be substantially reduced without sacrificing the detection 
performance. As an added benefit, the size of MCCS(X) in the normal dataset 
provides an intuitive measure of the complexity of a process. This is because, 
intuitively, with respect to the same detection performance, the more complex 
the process is, the larger MCCS(X) is. Furthermore, this technique for dataset 
trimming can be utilized in other domains as well, such as information retrieval 
and computer forensic. 



Effect of the trimming scheme As mentioned earlier, in order to be valid, 
any trimming of the training dataset must not lead to any loss in the efficiency 
of stide detectors. That our trimming procedure indeed satisfies the criterion is 
shown by the Theorem 7. 

Theorem 7 Let denote the critical section, and ££| t the remaining part in 
the normal dataset. Thus, Ej| t = £ nm ; . The future normal dataset is de- 
noted as T, new . Then, for all (known and unknown) intrusions with Mi^5(Si„ t |S r , 

Eneto) ^ A, 

\MSS\ min (E new\^nml) ^ MFS \ rain i^int \ ^nml Snetu) 

> \MFS\ mm (Z int \ ^nml 

Proof 2 In the trimming scheme, we assume that \MSS\ m in 

(Sj| t |S(* n ) = A (> 0), i.e., the detection performance of the stide detector built 

4 It is done in a wrap-round fashion as the normal dataset splitting policy in the same 
application. 
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by the critical section is stable at A to detect intrusions that cause MSSs smaller 
than A. From its definition, \MSS\ m i n (T,^ t © E„ et0 |S(* n ) is affected by the 
foreign sequence(s) S (S e MFS m i n (T,^ t S„ eu) |S^ n ) ) under the following 
two scenarios: 
CASE 1: SeSS^st); 

sessm s st ),s?ssm s rn ) 

=> \S\ = A + l 

=> \M S S\ m i n {T^ s st S neM ,|Ej* n ) = A 

=> \MSS\ m in(^tst © ^ne» l^trn) 

>S'| m i n (Ei n t |S nm ; S„ etu ) 

CAStf 2; S* £ SS(S?2t); 

S^SS(E t ^),S^55(SS: n ) 

^ <5 C S„ etu , S 1 <£_ S nm ; 

=>■ I AfS'<S'| m i n (E„ et0 |E„ m /) = \S\ — 1 

|AfiS l iS l | m j n (E neu) |E nm j) = \MSS\ m i n (^tg t © E„ etu |Ej* n ) 
=4- | Af/S l <S l | jn j n (S(| t S neu) |Sj* n ) 

Based on the results under these two scenarios, the theorem is proved. 

5.3 Experimental evaluations 

The splitting procedure in our application works as follows. The length of the 
training dataset Sizej is varied from 1 to 99% of the normal dataset, with a 
step of 7%, and the remaining portion of the normal dataset is designated as 
the test dataset. The splitting position Posi is also varied dynamically from 
1% to 99% of the normal dataset, using wrap around, with a step of 7%. Thus, 
m = n = 15. The maximum length for MSSs in any test dataset is kept fixed 
at N=25 (as all the MSSs and MFSs obtained are well within this limit). 

In our experiments, the following aspects of the framework will be evaluated: 

A) The influence of the completeness of the training dataset on the MFSs in 
the intrusive dataset; 

B) The influence of the completeness of the training dataset on the MSSs in 
the test dataset; 

C) The effectiveness of the trimming procedure, and the related graphical 
tools. 
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5.3.1 Evaluating the completeness of the training dataset 

In Figure 3, the MFS-MSS average curves for processes are illustrated 5 . From 
these curves, the varying sensitivity of the stide detectors to the complete- 
ness of the training dataset is obvious. For processes 'named', 'xlock' and 'lpr' 
from MIT, the efficiency of stide detectors is quite sensitive to the complete- 
ness of the training dataset. For process 'sendmails' from CERT, the efficient 
stide detectors can be obtained even with a small size of the training dataset. 
At the same time, the minimum foreign sequences of the intrusive dataset 
are not affected that much by the completeness of the training dataset since 
\MFS\ mini^int^tm © ^tst) is small (less than 2) for 'named', 'lpr' and 'send- 
mail' from CERT (Equation 10). However, if the MFS(s) of an intrusion is 
larger than 2, such as 'decode-280', the influence of the minimum common false 
positive sequences in the intrusive dataset can be observed clearly as the com- 
pleteness of the normal dataset increases. It is worth noting that the answer to 
the 'Why 6?' [24] question is provided explicitly by Figure 3.d. 

General speaking, the degree of sensitivity of the stide detectors to the com- 
pleteness of the training dataset may be influenced by the complexity of the pro- 
cesses and/or the audit trails collection tools. If the function of a process is sim- 
ple, the complete training dataset is easy to collect, and the detection efficiency 
is more influenced by the intrusion characteristics | MF S \ m in{^'int\'^trn © S tst ) 
(Theorem 5). Otherwise, the completeness of the training dataset is hard to 
be guaranteed, and the stide detector is more sensitive to the completeness of 
the training dataset. In other words, stide is very efficient to detect intrusions 
into a process with simple function, but its efficiency will be deteriorated when 
detecting a complex process. For instance, stide is not appropriate to detect the 
intrusions in Internet since the traffic behaviors in Internet are very dynamic, 
and even evolved with time. 

5.3.2 Identifying critical sections using MMM 

In the MMM matrix, the efficiency of every entry is indicated by its darkness, 
which is defined by the value \MSS\ m in 

(E ts t{i, j)\T, trn (i, j)) — A. As in Figure 4, the darker elements of the matrix 
indicate the efficient entries, and the lighter elements of the matrix indicate the 
inefficient entries. Therefore, based on Theorem 6, the darker the entry in the 
MMM matrix is, the more possibility to train efficient stide detectors from the 
training dataset determined by the entry (i,j). Furthermore, for every splitting 
point, the efficiency transition is clearly visible as it is the transition from a 
lighter entry to the first darker one. 

In our experiments, we let A — 6. From these efficiency transitions in the 
MMM matrices (Figure 4), the critical sections in the normal dataset for any 

5 For clarity, in this figure, we have grouped the intrusions which have the same MFS 
sequences with the increase of the completeness of the training dataset. For the same reason, 
some MFS sequences will be organized in a table if they arc too near to be distinguishable 
from their curves, such as the table in Figure 3.d. 
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Table 3: The most compact critical sections when A = 6. 



Process 


MCCS(\){%) 


\MCCS{\)\ 


'named' from UNM 


[92, 100]U[0, 84] 


92% - - 8492126 


'lpr' from MIT 


[85, 100]U[0, 63] 


78% - - 2282517 


'sendmair from CERT 


[92, 100]U[0, 7] 


15% - - 236412 


'scndmail' from UNM 




64, 100 


U 


0, 14 




50% - - 899882 


'wu-ftpd' from UNM 




92, 100 


u 


0, 28 




36% - - 64913 


'xlock' from UNM 




71, 100 


U[0, 70 




99% - - 335785 



process can be identified easily for every splitting point. For example, for the 
process 'lpr', at the splitting point Pos 12 = 78%, C5(12, 6) = [78%, 100%] U 
[0,63%]. Finally, the most compact critical section is gotten for every process 
(e.g., for 'lpr', MCCS(6) = [85%, 100%] U [0,63%]). In our experiments, the 
MCCSs for various processes in the normal datasets are shown in Table 3. 
Obviously, the beginning part [0, 7%] and the end part [92%, 100%] are included 
in all these most compact critical sections. As discussed in [17], the beginning 
and end transactions of a process are critical in building the normal behavior 
model, and thus affect the stide efficiency. 

From the critical section set for every normal dataset in MMM matrix, the 
sensitivity of the efficiency of stide detectors to the constitution and complete- 
ness of the training dataset can be identified more meticulously. Furthermore, 
from the actual size (i.e., not the percentage) of MCCS(X) in the normal dataset 
of a process, we can get a rough indication of the complexity of the process. For 
example, from Table 3, we can infer the following order in terms of complex- 
ity among various processes in the experimental datasets: u wu-ftpd ^ sendmail 
from CERT < xlock < sendmail from UNM ^ lpr from MIT ^ named" . 

5.4 Identifying the intrusion context 

By identifying the context associated with each alarm generated by the stide 
detector, it is possible to separate true alarms from false alarms. This then can 
be useful in designing more accurate detectors, and reducing (or removing) the 
false alarms. 

5.4.1 Foreign sequences graphs 

To identify the intrusion context, the intrusion dataset can be processed in two 
ways: (1) splitting it into blocks and evaluating every block; (2) evaluating every 
event in the intrusive dataset. We use the second option because the splitting 
process has the potential to break the foreign sequences, producing spurious and 
misleading anomalous sequences. To evaluate the impact of every event e^, we 
determine the Foreign Sequence Length of or FSL(ei), which is the length of 
the first precedent foreign sequence that ends with (Algorithm. 1). We plot 
the values of FSL(ei) against the index i of the event to generate a graph called 
the foreign sequence graph (FSG). 

According to our proposed framework, the MFSs can be determined from 
the foreign sequence graph as follows. In practice, if an MFS in one intrusive 
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Algorithm 1 Calculating the foreign sequence lengths for all the events in an 
intrusive datasct. 

Require: The event sequence of the process ei, e2, • - • , e p ; and the stide self model series 
stide\, stidez, . . . , stideN\ 
for i— 1 to p do 

scq=NULL; FSL( ei ) = N + 1; 
for j=0 to N-l do 
If i-j sgO break; 

Insert ei-j to seq as the first element; 
if seq is-not-in stidc(j + l) then 

FSL(ei) — length(seq); break; 
end if 
end for 
end for 

Output FSL(ei), FSL(e 2 ), . . . , FSL(e p ) 



process is e a e a +i ■ ■ ■ &b (b — a + 1 < N), it will be identified by the lowest point 
in its foreign sequence graph, which is expressed as FSL(eb) = b — a + 1, since 
FSL(eb-i) ^ b — a + 1, and FSL(eb+i) ^ b — a+1. Thus, the precedent b — a+l 
events constitute the intrusion context, that is, the minimum foreign sequence. 

5.4.2 Minimum foreign sequences in FSGs 

According to the definition of minimum foreign sequences, the non-MFS foreign 
sequences are formed by augmenting minimum foreign sequences. For example, 
suppose that an MFS is x\, x 2 , ■ ■ ■ , xi, the foreign sequence can be constituted 
as follows: j/i . . . y m xix 2 ■ ■ ■ xiy m+1 . . . y m+n , where, m > 0, n > 0, j/i . . . y m and 
Vm+i ■ ■ ■ ym+n are not foreign sequences. Since these two parts in the foreign 
sequence provide no additional information in comparison with the MFS in them 
[23], they will be filtered out before further analysis. 

Fortunately, in the generation of the foreign sequence graph (Algorithm 1), 
the prefix sequence j/i . . . y m is filtered out automatically. Therefore, to collect 
the minimum foreign sequence, only the suffix sequence y m +i ■ ■ ■ ym+n needs 
to be eliminated. The method to filter out the suffix sequence is trivial: if 
FSL(ei) = FSL(ei-i) + f, the foreign sequence identified by FSL(ei) will be 
filtered out since it is included in the foreign sequence identified by FSL{e.i-i). 

As stated before, the false alarms in the intrusive dataset or in the test 
dataset can be identified and analyzed as well by the intrusion context identifi- 
cation scheme. In summary, the scheme will be useful to study the characteris- 
tics of the intrusions, to remove the false alarms in the detection phase, and to 
improve the efficiency of the AID detection techniques. However, every coin has 
two sides. The identified intrusion context can be utilized to design smarter in- 
trusions, such as the information hiding techniques [26] and the mimicry attacks 
[27]. 

5.5 Experimental evaluations 

In the experiments, the following aspects about the scheme will be evaluated: 
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A) The effectiveness of the foreign sequence graph, and how to identify the 
intrusion context; 

B) The significance of the minimum foreign sequences in the intrusive dataset. 

5.5.1 Identifying the intrusion context from FSGs 

The following figures (Figure 5) summarize the foreign sequence graphs for every 
intrusion into the chosen processes. For the convenience of comparison, some 
FSG graphs are compressed into one subfigure, and their borders are split with 
vertical lines for different intrusive datasets. To easily identify the boundaries, 
we introduce dummy values of FSL = —4 between different intrusive datasets, 
and FSL = — 1 between different processes in one dataset. 

From these foreign sequence graphs, we can make the following observations: 

• Some intrusions cannot be detected in the first (beginning) stage by stide- 
like anomaly detectors since there are no foreign sequences in that stage, 
such as the buffer overflow into 'named'; yet some intrusions can be de- 
tected in the first stage, such as the 'lprcp' into 'lpr from MIT'; 

• Different runs of the same attack (intrusion) have almost the same foreign 
sequence graphs (or intrusive characteristics), such as the sunsendmailep 
in which the three different runs (10763, 10801, 10814) have the same 
foreign sequence graph; 

• For different intrusions into one process, the foreign sequence graphs are 
not the same, and they are intrusion-specific. Therefore, to detect all 
intrusions into a resource, one specific detection strategy (such as the 
stide detector with defined length) is not enough; 

• For most of the intrusions, there are obvious precursors at the beginning 
of the anomalous events, and some of them are manifested by the foreign 
sequences with larger length. It hints at the existence of a tradeoff be- 
tween the MMTA (Mean Time To Alarm) of anomaly detectors and their 
efficiency (reflected by the length of sequences) . 

5.5.2 Minimum foreign sequences 

The minimum foreign sequences for 'decode' are listed below. From a look at 
the list, the answer to the 'why 6?' problem is obvious (l:exit, 2:fork, 5:open, 
6:close, 19:lseek, 95:connect, 112:vtrace). 

• decode-280 

process 283: 2-95-6-6-95-5 

• decode-314 

process 317: 112-6, 6-19, 2-95-1, 2-95-6-6-95-5. 
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From the total number of non-duplicated MFSs in every intrusive dataset 
(Figure 6), we know that (1) the detector window DW = 2 is enough to detect 
most of the intrusions; (2) the intrusion ' decode-280' leads to the magic number 
6 for stide; (3) if the detection window DW > 7, the efficiency of stide detectors 
will not be improved much as expected. 

Table 4: Shared Minimum Foreign Sequences by the same intrusion into the 
same process. 



Intrusion 


No. of MFSs 


No. of 




of Each Run 


Shared MFSs 


decode 


{2,5} 


2 


buffer overflow into xlock 


{68,68} 


68 


buffer overflow into named 


{59,38} 


33 


sunsendmailep 


{24,24,24} 


24 


forward loops 


{33,11,30,33,4} 





syslog-local 


{55,71} 


52 


syslog-rcmotc 


{78,42} 


42 



In addition, we also note that different runs of the same intrusion into one 
resource will share most of the minimum foreign sequences (Table 4) . It is quite 
notable that different runs of 'sunsendmailep' have the same set of minimum 
foreign sequences. This discovery benefits the research on anomaly-based intru- 
sion detection because the diversity of different runs of the same intrusion is not 
too large to design one specific (or ad-hoc) IDS system for each of its runs. At 
the same time, it strengthens two assertions that the foreign sequence graphs 
are intrusion-specific, and that different runs of the same intrusion have almost 
the same characteristics. 

Also, the large quantity of minimum foreign sequences given in Table 4, 
especially for the intrusion 'buffer overflow', will discourage the mimicry attacks 
[27] and the information hiding paradigm greatly. This is because all the large 
quantity of MFSs must be mimicked to achieve a successful mimicry attack. In 
addition, after a careful manual identification, the minimum foreign sequences 
can also be applied to construct the 'intrusion signatures' 1 to be applied in 
signature-based intrusion detection techniques. 

6 Conclusions and future work 

In this paper, a general framework is proposed to determine the operational 
limits of stide detectors. Tan and Maxion [25] in their attempt to solve the "Why 
six?" problem, identified the length of the minimum foreign sequence in the audit 
data as a lower bound for the length of stide detectors. Our work complements 
their effort by showing the effect of completeness of the normal model on stide's 
performance, and establishing an upper bound for the length of the detector. 
In addition to generalizing Tan and Maxion's results, this framework provides 
a formal ground for analyzing future stide-like AID detectors that are based 
on sequence analysis, by exploring the dynamics of the various factors affecting 
operational limits of stide, i.e. the false positives and true positives. Based 
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on the formal framework, the foundations of several work related to stide are 
interpreted in a logical way. 

The experiments we conducted not only validate our theoretical results, they 
also provide further insights by clearly showing the inter-dependencies of the var- 
ious factors affecting stide's performance, i.e., the influence of completeness of 
the training dataset on stide efficiency is evaluated. The conclusion on the com- 
pleteness evaluation is that stide is not appropriate for dynamic scenarios, such 
as the traffic behaviors in Internet. Then, two applications of our framework 
are also designed to demonstrate the usefulness of our framework. One is the 
trimming procedure for the normal dataset, in which the redundant parts in the 
normal dataset are filtered out for further analysis. To achieve them, two graph- 
ical tools are designed to identify the influence and the most compact critical 
section: MFS-MSS Average Curve (MMAC) and MFS-MSS matrix (MMM). 

From the MMAC curves, the influence of the completeness of the training 
dataset on the MSSs in the test dataset and the MFSs in the intrusive dataset are 
analyzed. The existence of the minimum common false positive sequences are 
also confirmed in the MMAC curves. At a finer granularity, the MMM matrix is 
utilized to find the most compact critical section within the normal dataset for a 
specific detection performance A. The MMAC curves and the MMM matrix also 
provide an intuitive indication of the complexity of the corresponding process. 

In this framework, the questions related to the 'Why 6V problem can be 
answered clearly, such as the question in [25] , ' to what extent can we establish a 
link between detectable anomalies and intrusive behaviors?', and the answer lies 
in Theorem 6. After analyzing the influence of the completeness of the training 
dataset of a process on the efficiency of the stide detectors, we can determine 
whether stide is appropriate for detecting any intrusion into that process. 

The second application of the framework, which is first introduced here, is 
the intrusion context identification in an intrusive dataset using the foreign se- 
quence graphs. From the minimum foreign sequences of intrusions, the following 
findings, which will benefit the research on anomaly-based intrusion detection, 
are reported: 

1. Different runs of an intrusion almost have the same characteristics; 

2. Different intrusions into one process will cause different anomalies in the 
intrusive datasets; 

3. Most of the intrusions have precursors, which are useful to provide short 
MMTA; 

4. Some intrusions can not be detected in the first stage when no anomalies 
are caused. 

5. There is diminishing rate of return in terms of efficiency with the increase 
of the detector window size. 

Limitations of the Framework. However, while using the proposed frame- 
work, we should bear in mind certain limitations of the framework. These are 
briefly stated below. 
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1. Like any AID technique, the framework is based on the assumption that 
any anomaly in the intrusive dataset is an indication of an intrusion into 
the resource. Even though the assumption is reasonable under most cir- 
cumstances, it is possible that a non-malicious access that deviates from 
the normal behavior will be detected as an intrusion. On the other hand, 
if an intrusion can successfully mimic the normal behaviors, then no AID 
technique can detect such an intrusion [26]. 

2. The framework specifically deals with stide-like AID techniques which 
assumes that intrusions are manifested in the sequences of system calls and 
tries to detect them by a systematic analysis of such sequences. Therefore 
it may not be appropriate to apply it to all possible intrusions or detection 
techniques [27] [26]. 

In our future work, the framework will be further evaluated by the datasets 
under different environments, e.g. the networks and the windows platform. 
Then, the definitions for effective, complete and efficient anomaly-based intru- 
sion detectors will be generalized to other sequence-based AID techniques. As 
a practical and promising method to analyze the intrusion characteristics, the 
mechanisms for intrusion context identification will also be extended to other 
AID techniques in our further study. 
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Completeness of Training Dataset (%) 




(a) Process 'named'. 



(b) Process 'lpr' from MIT. 





(c) Process 'sendmail' from CERT. 



(d) Process 'sendmail' from UNM. 
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(e) Process 'ftpd'. 



(f) Process 'xlock'. 



Figure 3: MFS-MSS average curves for different processes. 
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Figure 4: MFS-MSS Matrix for different processes (A = 6). 
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The FSG for the Intrusion 'buffer overflow' into 'named' 




The FSG for the Intrusion 'Iprcp' into "Ipr" 
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(a) The Process 'named', and the 
Intrusion is buffer overflow. 
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(c) The Process 'sendmail' from 
CERT, and the Intrusion is syslog. 



(d) The Process 'sendmail' from 
CERT, and the Intrusions are 
sm565a and sm5x. 



The FSG for the Intrusion 'decode' into 'sendmail-UNM' 
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(e) The Process 'sendmail' from 
UNM, and the Intrusion is decode. 



(f) The Process 'sendmail' from 
UNM, and the Intrusion is forward 
loops. 
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(g) The Process 'sendmail' from 
UNM, and the Intrusion is sunsend- 
mailcp (3 processes). 



(h) The Process 'ftpd', and the In- 
trusion is misconfiguration. 
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The FSG for the Intrusion 'buffer overflow' into 'xlock' 



MFS Length 


1 


2 


3 


4 


5 


6 


7 


8 


9 


10 


11 


12 


13 


14 


15 


16 


17 


18 


19 


20 


21 


22 


23 


24 


25 


l-l 


named-1 


5 


34 


11 





3 


1 





2 

















1 





1 





1 























59 


named-2 


5 


19 


6 


1 


1 


1 





1 








2 





1 





1 
































38 


lpr-1 


26 


11 







































































37 


sm-cert-1 


3 


20 


13 


5 


1 


4 


2 








1 








1 


1 








2 


2 























55 


sm-cert-2 


3 


27 


20 


5 


2 


3 


4 








1 








1 


1 








2 


2 























71 


sm-cert-3 


3 


30 


20 


5 


2 


3 


4 


2 





1 


1 





2 


1 








2 


2 























78 


sm-cert-4 


2 


9 


9 


4 


1 


3 


4 


2 





1 


1 





2 











2 


2 























42 


sm-cert-5 





8 


2 


2 





2 



































1 























15 


sm-cert-6 





23 


25 


17 


6 


4 


5 


2 


2 


1 








1 











4 


2 


4 




















-j8 



















1 















































1 











2 


sm-unm-2 





2 


1 








1 















































1 











5 


sm-unm-3 





11 


9 


5 


2 





2 










1 



































1 





1 


33 


sm-unm-4 





3 


1 


2 








2 


1 








































1 











11 


sm-unm-5 





11 


9 


2 


2 





1 










1 


























1 








1 





1 


30 


sm-unm-6 





10 


10 


5 


2 





2 










1 



































1 





1 


33 


sm-unm-7 





3 


1 



































































4 


sm-unm-8 





7 


7 


4 


2 
















1 
































1 








1 


24 


sm-unm-9 





7 


7 


4 


2 
















1 
































1 








1 


24 


sm-unm-10 





7 


7 


4 


2 
















1 
































1 








1 


24 




3 


34 


14 


L8_ 


4 


2 


1 





2 





lp_ 


l°_ 









































68 



named-1 ,2: the buffer everflow-1 ,2 

sm-cert-1. .6: syslog-local-1 ,2, syslog-local-3,4 sm565a and sm5x 

sm-unm-1 ..10: decode-280,314, forward loops-1 ,2,3,4,5, sunsendmailcp-1 0763,1 0801 ,1 081 4 
ftpd-1: misconfiguration 
xlock-1 ,2: buffer overflow-1 ,2 



Figure 6: The number of Minimum Foreign Sequences (non-duplicated) in 
intrusive datasets. 
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A Proofs of some Theorems 
A.l Proof of Theorem 1 

Proof 3 Assume that S G MFS min (Y, tgt \Y, ref ). Thus, \S\ = \MFS\ min (T, tgt |E re/ ). 
From the definition of MFS(T,t g t\H re f), S G FRGN{Yi tg t\'^'ref), and there ex- 
ists an 1-order subsequence S' G SELF(T, tg t\^ref) at least considering that 
FRGN(E tgt \i: ref )USELF(i: tgt \E ref ) = SS(E tflt ). Therefore, (S' G SELF(V tgt \Z r , 
(S G SS{Y, tgt )) A (5 *=i S") A (5 ^ SELF(Z tgt \Y, ref )) = True, and we can con- 
clude that S' G MSS(Ztgt\Z r ef)- 

Then, using proof by contradiction to prove that S' G MSS min (T, tgt \T, re f). If 
S' G' MSS m i n {Yi tgt \Yi re f), there must be another sequence S" G M S ' S m i n (Y, tgt \Y, re f) , 
and \S"\ < \S'\. Following above deduction, we can determine that there is one 
S'" G MFS{T, tgt \H ref ), and S'" H S" . Considering \S\ = \S'\ + 1 and \S"'\ = 
\S"\ + 1, we get \S"'\ < \S\, which contradicts to \S\ = \MFS\ min (E tgt \E re f) as 
S'" G MFS(Y, tgt \Y, re f). 

Considering that S G MFS min (E tgt E re /), S' G MSS min (T, tgt \T, ref ), and 
\S\ = \S'\ + 1, the following equation is held: 

\MSS\ m in(^tgt\^ref) = \MFS\ m in (J^tgt \ Ere/) — 1 

A. 2 Proof of Theorem 2 

Proof 4 From the definitions, we know that: 

\MFS\ min {H int \L trn ) = mm{{l\l > 0;SS(E int ,0 - SS(X trn ,l) ± $}) 

(1) Ifuj > \MFS\ m i n (T, int \T, trn ), then, 

SS(Z int ,u) - SS(H trn ,u>) £ <& TPSS(X int \X trn ,u) $ 

Hence, the stide detector of length to built from E tr „ is effective w.r.t. Ej„t. 

(2) If the stide detector with the length w built by E tr „ is effective w.r.t. E^, 
then, 

TP55(E mt |E tr „,w) ^$ =^ SS(Z int ,u;)-SS(i:trn,u))^$ 

=> lo ^ Mi* 1 S'|„ l i„(E i „ t |E tr „) 

A.3 Proof of Theorem 3 

Proof 5 From the definitions, we know that: 

\MSS\ min (E tst \E trn ) = max({Z|Z > 0; SS(E tat) /) - SS(Z trn , I) = $}) 
(1) If lo < |MSS| min (E tst |E trn ) } tften, 

SS(E t st,w) - SS(E trn ,u;) = $ =>• F_P55(E t;st |E trn ,a)) = $ 
Hence, the stide detector with the length lo built by Ei r „ is complete w.r.t. E t;st . 
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(2) If the stide detector with the length u> built by E trn is complete w.r.t. T, tst , 
then, 



FP55(E tst |E trn ,w) = $ =>■ 55(E tst ,w)-55(E trn ,u;) = $ 

==>- w < |M55| TO i„(E tst |Ej r „) 

A. 4 Proof of Theorem 5 
Proof 6 From the MFS definition, 

\MFS\ m in(Z in t\Ztrn) = min({l\l > 0; SS(Z int , I) - 55(E tr „, I) ^ *}) 

TTie foreign sequence length vector for E trn and E, ni is 

FSLV(E int |£ trn ) - {l\l > 0; 55(£ mt , Z) - 55(E tr „, I) ± $} 

ForVleFSLV(X int \X trn ), 

SS(X int ,l)-SS(i; trn ,l)^$ 
o 35(5 G (55(E m4 , J) - 55(E trn , Z))) 
o 35(5e55(£ mt ,Z)A5^55(£ t „ l ,Z)) 

o 35(5 G 55(E m4 , A 5 £ 55(£ 4r „, Z) A (5 € 55(E tst , Q V 5 £ 55(E m , /))) 

o 35((5 G SS(Z int ,l) A 5 g 55(E trn , I) A S £ 55(E tst , /)) V (5 e 55(£ mt , Z) A 5 55(£ 4r „, Z) A 5 £ 55i 
35((5 G 55(£ m4 , Z) A 5 55(E trn , Z) A 5 e 55(E tst , I)) V (5 G 55(S mt , Z) A 5 £ (55(E trn , Z) V 55(E 4 , 
o 35((5 G 55(£ m4 , Z) A 5 G (55(£ tst , Z) - 55(£ tr „, Z))) V (5 G 55(£ mt , Z) A 5 £ 55(£ tr „ £ tst , /))) 
o 35((5 G SS(E int ,l) A 5 G FP55(E tst |E tr „, Z)) V (5 G 55(S mt , Z) A 5 £ 55(£ tr „ £ tst , Z))) 
o 35(5 G (55(E mt , Z) n FPSS(Z tst \Z trn , I)) V 5 G (55(E m4 , Z) - 55(S tr „ E tat , /))) 
o 35(5 G ((55(E mt , Z) n FP55(£ tst |£ trn , Z)) U (55(S mt , Z) - 55(£ tr „ E tst , Z)))) 
55(£ mt , Z) n J F 1 P55(Et s t|E tr „, Z) ^ $ V 55(£ mt , Z) - 55(E tr „ E tst , I) ^ $ 

-ffence, 

FSLV(£ int \Z trn ) = {Z|Z > 0; 55(E mt , Z) n FP55(£ tst |£t™, Z) ^ $} U {Z|Z > 0; 55(E mt , Z) - 55(E tr „ E t( 

Finally, 

|MF5| mm (E 

= min(F5 J LF(E mt |E tr „)) 

= min({Z|55(E mt , Z) n FP55(E tst |E trn , Z) ^ $} U {Z|55(E mt , Z) - 55(E 4r „ E tst , Z) ^ $}) 
= min(min({Z|55(E mt , Z) n PP55(£ tst |£ tr „, Z) ^ $}), 
min({Z|55(E mt , Z) - 55(E tr „ E tst , I) + $})) 

= mm(\C F P S\ m i n (T,i n t , £( s t |£t rn ), MFS\ mini^int |E( rn 
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A. 5 Proof of Theorem 6 

Proof 7 If\MSS\ min (Z tst \ ) > i, 

Vi(l < J < |MSS| mm (E tst |E tr „),FPSS(E tst |E tr „,0 = $) (14) 

Furthermore, according to the definition ofCFPS, 

\C F P S\ m i n (Y^i n t , ^tst\^trn ) > |MSS| min (£ t8t |E trn ) (15) 

(<=) If \MSS\ min (Z t st\Z trn ) ^ \MFS\ mm (T. in t\T, trn £ tst ), t/iere exist effi- 
cient stide detectors for datasets £ trn , E tst and Ej nt . 

© Et«t) 

=> \MSS\ m i n (Htst\^trn) 1 

\CFPS\ min (X 

=>- |CFPS'| m i n (Ei nt , S tst |E tr „) > \MFS\min{ E tst ) 

=> \MFS\ m i n (T,tst\'E'trn) = \M F S\ m in(^'int\'^trn & ^tst) 

|MFS , | m i„(E tst |E tr „) < |M55| m i n (E tst |Et rn ) 

Prom Theorem 4, there exist efficient stide detectors under this scenario. 

(=^>) If there exist efficient stide detectors for Htm, H ts t andY. int , |MSS| m j„(E tst |E, 

int | ^trn 

To apply the proof by contradiction, let us assume |M55| m j n (E tst E tr „) < 

I MFS | min (T, int | E tr „ E tst ). 

f§ij //|MSS| min (S tat |E trn )=0, 

From the MFS definition, |MF5| min (E tst |E trn ) ^ 1 > |M55| mm (E tst |E trn ). 
Based on Theorem 4, there does not exist efficient stide detectors, and that is 
contradict with our statement. So, |M55| m j„(£ tst |E trn ) < \MFS\ m i n (T, int E tr „0 
E tst ) is not correct. 
(%2) If\MSS\ min (H tst \H trn ) > 1, 

(2. a) If \CFPS\ m in(Hi nt , E tst |E trn ) ^ MFS| m i„( 0E tst ), 
From equation (10), \MFS\ m i n (H int \H trn ) = \MFS\ min (T, int \Y, trn QY, tst ). Fur- 
thermore, we assume \MFS\ min (Z int \Y, trn E tst ) > \MSS\ min (Z t 8t\Etm), 
therefore 

int | ^trri 

(2.b) If \CFPS\ m in(T,i nt , E ts t|E trn ) < |MP5| TO j„(Ej„t|E tr „ E tst ) 7 
FromEqn.10, \MFS\ min (T, int \Y, trn ) = \CFPS\ min (T, int , E tst |E tr „), andEqn.(15), 

\CFPS\ m i n (T,i n t,T,tst\Htrn) > \ MSS\ m i n (E tst I Et r „) 
=>■ |Mi ? 'S| m i n (Ej n t|Et rn ) > MS'<S'| m i n (E tst |E trn ). 

From f#.aj, fi?.&) and Theorem 4, there are no efficient stide detectors, and that 

is contradict to the statement. Therefore, |M55| m i n (E tst |E trn ) < | MPS' | m j„(£j„ t |E. 

Etst) is not correct. 

Based on (1) and (2), the theorem is proved. 
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